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O , Abstract 

Graph inference is a burgeoning field in the applied and theoretical statistics communities, 



as well as throughout the wider world of science, engineering, business, etc. Given two graphs 
on the same number of vertices, the graph matching problem is to find a bijection between 
the two vertex sets which minimizes the number of adjacency disagreements between the 
two graphs. The seeded graph matching problem is the graph matching problem with an 
additional constraint that the bijection assigns some particular vertices of one vertex set to 
respective particular vertices of the other vertex set. Solving the (seeded) graph matching 
problem will enable methodologies for many graph inference tasks, but the problem is NP- 



hard. We modify the state-of-the-art approximate graph matching algorithm of Vogelstein et 
al. (2012) to make it a fast approximate seeded graph matching algorithm. We demonstrate 

O 

the effectiveness of our algorithm - and the potential for dramatic performance improvement 

O from incorporating just a few seeds - via simulation and real data experiments. 
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1 Introduction 



All graphs in this manuscript are simple; that is, the edges are undirected and there are no loops 
or multiple edges. In other words, the adjacency matrices for graphs are binary, symmetric, and 
hollow. This restriction to simple graphs is for convenience only; indeed, all of our work and 
analysis can be naturally extended to settings with more general graphs. 

Suppose G\ and G 2 are two graphs with respective vertex sets V\ and V 2 such that \Vi \ = \V 2 \. 
For any bijective function : V\ — > V 2 , the number of adjacency disagreements under is defined 
to be 

d{4>) := \{{u,v) e Vy x V x : [u ~ Gl v and (j){u) ^ G2 (p(v)} or [u v and 0(u) ~ G2 0(^)]}|- 

The graph matching problem is to minimize d(<fi) over all bijective functions : V\ — > V 2 . This 
problem is NP-hard; in fact, even the simpler problem of deciding whether there exists a graph 
isomorphism between G\ and G 2 is notoriously of unknown complexity (and, indeed, is suspected 
to belong to an intermediate complexity class which is strictly between P and NP-complete). Thus, 
in particular, there are no efficient algorithms known for graph matching, and it is suspected that 
none exist. 

The development of graph matching heuristics is a venerable and active field. An excellent 
survey article by Conte, Foggia, Sansone, and Vento titled "Thirty years of graph matching in 
pattern recognition" |2j outlines successful application of approximate graph matching to two- 
dimensional and three-dimensional image analysis, document processing, biometric identification, 
image databases, video analysis, and biological and biomedical applications. The current state- 
of-the-art algorithms can provide effective and realtime approximate graph matching for graphs 
with hundreds or thousands of vertices [7]. 

In this manuscript, we utilize the approximate graph matching algorithm of Vogelstein et 
al. [7] which they call "FAQ" (an acronym for Fast Approximate Quadratic Assignment Problem 
Algorithm); its running time is cubic in the number of vertices and, in practice, the quality of the 
approximate solution and the speed of the algorithm are state-of-the-art. The relevant details of 
FAQ will be specified later, in Section [2j 

Now consider that we are also given subsets W\ C V%, W 2 C V 2 such that \W%\ = \W 2 \ and we 
are given a fixed bijection ip : W\ — > W 2 . The seeded graph matching problem is defined to be the 
problem of minimizing d(<p) over all bijections : V\ — > V 2 that are extensions of ip - that is, 
must agree with ip on W\ (i.e., for all u G W\, 4>{u) = ip{u)). The elements of W\ are called seeds 
and the bijection ip is a seeding. In Section |2j we modify the approximate graph matching FAQ 
algorithm for use in approximate seeded graph matching. 
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When we say that U G\ on vertex set V\, and Gi on vertex set Vi are random graphs drawn from 
the same distribution, with correspondence function (for the bijective function ^ : V± — > V2), 
we mean that there are specified probabilities for each of the 2( 2 ) possible graphs on the vertex 
set V\ and, from this probability distribution, the two graphs G\ and G2 are realized (perhaps 
independently, or perhaps with dependence) and then - just in G2 - each vertex u G V\ is relabeled 
as ty(u) G V2, so that G\ remains on vertex set V\ but G2 now has vertex set V<i- The approximate 
graph matching solution <f> : V\ — > V2 may be viewed as an approximation for the underlying 
correspondence function : Vi — >• V2, if \f is partially or completely unknown. 

We will see in Section [3] that minimizing <p may be a poor approximation for perhaps 
agreeing with \I/ at only a few vertices, not much better than chance. However, we will also see 
that utilizing seeds W\ C V\ - and the seeding function ip consisting of the restriction of \l/ to W\ 
- can yield an approximate seeded graph match solution which agrees with $ on a much more 
substantial fraction of the nonseeded vertices from V\. 

While the literature on graph matching is vast, with [2] and [7J providing a comprehensive 
survey (2004) and a recent literature review (2012), respectively, there is precious little prior 
art for seeded graph matching: in pE] a very small seeded graph matching problem (12 vertices) 
is addressed, while [8] and [5] incorporate constraints that enforce correspondences to be only 
between vertices of the same "type" . 

The structure of this paper is as follows: in Section [2] we adapt the FAQ algorithm of [7J into an 
algorithm for approximate seeded graph matching; in Section [3] we demonstrate the effectiveness 
of our algorithm - and the potential for dramatic performance improvement from incorporating 
just a few seeds - via simulation and three real data experiments; we conclude in Section [4] with 
a discussion of implications and future work. 



2 Modified- FAQ for approximate seeded graph matching 

We are interested in solving the seeded graph matching problem but, as discussed earlier, this 
problem is NP-hard and so we have no expectation that there even exists an efficient algorithm. 
Thus we seek an approximate solution that can be efficiently computed. 



In Section 2.1 we express the seeded graph matching problem as an optimization problem with 



integer constraints, and then we relax the integer constraints by replacing them with nonnegativity 



constraints. In Section 2.2 we modify the FAQ algorithm of [7j into an algorithm that approxi- 
mately solves the relaxed seeded graph matching problem. Of course, when solving a relaxation, 
the solution may not in general be integer valued, and as such it would not be appropriate even as 



an approximate solution to the original (unrelaxed) problem. Therefore, in Section 2.3, we project 
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the solution of the relaxed optimization problem to the nearest element of the feasible region of 
the unrelaxed problem, and we then declare that to be the approximate solution to the original 
(unrelaxed) problem. 

2.1 The relaxation 

Recall that G\ is a graph on vertex set V\ and G 2 is a graph on vertex set V2 such that |Vi| = |V<>| = 
n + m, the set of seeds W\ is a subset of V\ and W2 is a subset of V2 such that \W\\ = \W 2 \ = m, 
and the bijection ip : W\ — > W 2 is the seeding. Without loss of generality, we will take V\ and 
V 2 to each be the set of integers {1, 2, . . . ,m + n}, we will take W\ and W 2 to each be the set 
of integers {1,2, . . . , m}, we will take if) to be the identity function, for some fixed nonnegative 
integer m and positive integer n. (When m = we have the (unseeded) graph matching problem.) 
Let A, B G ^(m+n)x(m+n) -j oe adjacency matrices for G\ and G2, respectively; this means that 
for all i,j G {1, 2, . . . , m + n} it holds that Oy = 1 or according as i ~g*i j or n °t, and fry = 1 
or according as % ~g 2 j or n °t- It will soon be useful to let A and B be partitioned as 

B21 B22 

where A u , B n e M mxm , A 12 , B 12 G M mxn , A aa , S 2 i e K nxm , and A 22 , 5 22 G E nxn . 

It is clear that the seeded graph matching problem is minimize \\A — (I mxm Q)P)B(I mxm (BP) T \\i 
over all n x n permutation matrices P, where I mX m is the m-by-m identity matrix, © is the 
direct sum of matrices, and || • ||i is the l\ vector norm on matrices; say the optimal P is P. 
Then the corresponding bijection <pp : {1, 2, . . . , m + n} — > {1, 2, . . . , m + n} defined as, for all 
i G {1,2,..., m} , <pp(i) = i and, for all i,j G {1,2,..., n}, 4>p(i + m) = j + m precisely when 
pij = 1, is the bijection which solves the seeded graphmatch problem. 

Of course, this optimization problem is equivalent to minimizing \\A(I mxm Q)P) — (J mxm ©P)-B||i 
or ||y4-(I mxm ©P)_B(/ mxm ©P) T || 2 or ||y4(/ mxm ©P)-(/ mxm ©P)P|| 2 , over all permutation matrices 
P, where || • || 2 is the £ 2 vector norm on matrices. Expanding \\A — (/ mxm © P)B(I mxm © P) T || 2 = 
ll^lli + ll-^lll — 2 • traceA T (/ mxm © P)B(I mxm © P T ), we see that this optimization problem is 
equivalent to maximizing traceA T (/ mxm © P)B(I mxm © P T ) over permutation matrices PQ 

As mentioned previously, graph matching in NP-hard, so we do not expect to ever find an 
efficient algorithm for seeded graph matching. In looking for an approximate solution for seeded 
graph matching, it makes sense to work with a relaxation; specifically, we concern ourselves with 

1 Note that although A and B are symmetric matrices, we nonetheless keep transposes in place wherever they 
are present to enable further generalization; our analysis will not change if we instead were in a broader setting 
where A and B are generic (nonsymmetric, nonhollow, and/or nonintegral) matrices in fl^( m + n ) x ( m +™) _ 
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A u A 12 
A 2 i A22 



B 
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first solving maximize trace74 T (/ mxm © P)B(I mxm © P T ) over all doubly stochastic matrices P, 
which means that P G R n such that Pl n = l n , P T l n = l n , and P > nxn coordinatewise, where 
nxn is the n-bj-n matrix of zeros and l n is the n-vector of all ones. Indeed, this is a relaxation 
of seeded graph matching in the sense that if we were to add integrality constraints - that P is 
integer- valued - then we would precisely return to the constraint that P is a permutation matrix, 
hence we would have returned to the seeded graph matching problem. 

2.2 Modified-FAQ 

The modified- FAQ algorithm is a modification of the state-of-the-art graph matching algorithm of 
[7], which they call FAQ, so that it can be used for approximate seeded graph matching. Modified- 
FAQ approximately solves the relaxed seeded graph matching problem - maximize traceA T (/ mxm © 
P)B(I mxm ®P T ) subject to P being a doubly stochastic matrix - by using the Frank- Wolfe Method, 
which is an iterative procedure that involves successively solving linearizations. It turns out that 
the linearizations can be cast as linear assignment problems that are solved with the Hungarian 
Algorithm. 

We first briefly review the Frank- Wolfe Method before proceeding to apply it. The general 
kind of optimization problem for which the Frank- Wolfe Method is used is 

(FWP) Minimize f(x) such that x G S, (1) 

where S is a polyhedral set (i.e., is described by linear constraints) in a Euclidean space of some 
dimension, and the function / : S — > R is continuously different iable. A starting point x^ 1 ' G S is 
chosen in some fashion, perhaps arbitrarily. For i = 1,2,3,..., the following is done. The function 
: S — > M. is defined to be the first order (i.e., linear) approximation to / at x^' - that is, 
p l '(x) := f(x^) + V f(x^) T (x — x®); then solve the linear program: minimize p l '(x) such that 
x G S (this can be done efficiently since it is a linear objective function with linear constraints, 
and note that, by ignoring additive constants, the objective function of this subproblem can be 
abbreviated: minimize V f(x^) T x such that x G S), say the solution is G S. Now, the point 
x (i+i) £ g i s defined as the solution to: minimize f(x) such that x is on the line segment from 
x^' to x^ in S. (This is a just a one dimensional optimization problem; in the case where / is 
quadratic this can be exactly solved analytically.) Go to the next i, and terminate this iterative 
procedure when the sequence of iterates x^\ x^ 2 \ x^\ ...stops changing much or develops a 
gradient close enough to zero. This concludes our review of the Frank- Wolfe Method. 

We now describe how modified-FAQ employs the Frank- Wolfe Method to solve the relaxed 
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seeded graph matching problem. The objective function here is 
f(P) 



trace 



= trace 
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= traced B n + traceA^PP^i + tmceA^ 2 B 12 P T + trace A 22 PB 22 P T 
= traced B u + trace P T A 21 B 21 + tmceP T Aj 2 B 12 + tmceA 22 PB 22 P 7 



which has gradient 



V(P) := A 21 B^ + A{ 2 B l2 + A 22 PB* 2 + A' 22 PB 



22- 



We start the Frank-Wolfe Algorithm at the doubly stochastic matrix P = (This is 

only for simplicity, and any other choice of doubly stochastic P might be as effective). In the 
next paragraph we describe a single step in the Frank- Wolfe algorithm. Such steps are repeated 
iteratively until the iterates empirically converge. 

Given any particular doubly stochastic matrix P e M nxn the Frank- Wolfe-step linearization 
involves maximizing traceQ T V(P) over all of the doubly stochastic matrices Q G IR nxn . This is 
precisely the linear assignment problem (since it is not hard to show that the optimal doubly 
stochastic Q can in fact be selected to be a permutation matrix) and so the Hungarian Algorithm 
will in fact find the optimal Q, call it Q, in 0(n 3 ) time. The next task in the Frank-Wolfe 
algorithm step will be maximizing the objective function over the line segment from P to Q; i.e., 
maximizing g(a) := f{aP + (1 — a)Q) over a G [0,1]. Denote c := trace A 22 PB 22 P T and d : = 
tmce(A 2 r 2 PB 22 Q T +A 2 r 2 QB 22 P T ) and e := traced QB 22 Q T andw := tTace(P T A 21 B' 2 r l +P T A'{ 2 Bi 2 ) 
and v := tr a.ce(Q T A^B^ + Q T Aj 2 B\ 2 ) . Then (ignoring the additive constant traceA^Bn without 
loss of generality, since it will not affect the maximization) we have g(a) = ca 2 + da(l — a) + e(l — 
a) 2 + ua + v (1 — a) which simplifies to g(a) = (c — d + e)a 2 + (d — 2e + u — v)a + (e + v). Setting 
the derivative of g to zero yields potential critical point a := ^^^^^ (if indeed < a < 1); 
thus the next Frank- Wolfe algorithm iterate will either be P (in which case algorithm would halt) 
or Q or aP + (1 — a)Q, and the objective functions can be compared to decide which of these 
three matrices will be the P of the next Frank- Wolfe step. 

At the termination of the Frank- Wolfe Algorithm, we have an approximate solution, say the 
doubly stochastic matrix P, to the problem maximize traceA T (/ mxrrt © P)B(I mxm © P T ) subject 
to P being a doubly stochastic matrix. 
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2.3 Projecting the approximate solution of the relaxed problem 

After the termination of the Frank- Wolfe Algorithm with P, what if P is not a permutation 
matrix? How do we get out of P a meaningful approximate solution to the seeded graph matching 
problem? The answer is that we will do one more step; we will find the permutation matrix Q 
which solves the optimization problem min \\Q — P||i subject to Q being a permutation matrix, 
and finally <pQ is our approximate seeded graphmatch solution. To solve this latter optimization 
problem, observe that for any permutation matrix Q 

\\q-p\\i = p«+ E C 1 -**) 

i,j6{l>2,...,n}:gy^l i,je{l,2,...,n}:gy=l 
ije{l,2,. ..,«.} i,jE{l,2,...,n}:qij=l 

= n + n — 2 ■ Pij 

i,je{l,2,...,n}:qij = l 

= 2n-2 traceQ T P. 

Thus, minimizing \\Q — P\\i subject to Q being a permutation matrix is equivalent to maximizing 
traceQ T P subject to Q being a permutation matrix; this latter optimization formulation is pre- 
cisely a formulation of the well-known linear assignment problem, and it is efficiently solvable in 
0(n 3 ) time with the Hungarian Algorithm. In this manner we can efficiently obtain <f>Q, which is 
our approximate seeded graph matching solution. 



2.4 Modified- FAQ is fast and accurate 

By limiting the number of Frank- Wolfe steps to a constant, the running time of modified- FAQ is 
cubic in the number of vertices, since that is the complexity of the Hungarian Algorithm. Since 
there is no appreciable difference in running time between modified-FAQ and FAQ, we have state- 
of-the-art running time, in practice, as reported for FAQ in [7]. In addition, FAQ finds the optimal 
solution for several of the benchmarks considered in [7J. 



3 Demonstrations 

We demonstrate the effectiveness of our fast approximate seeded graph matching algorithm via 
a simple but illustrative simulation study and three real data experiments]^] The potential for 
dramatic performance improvement from incorporating just a few seeds is undeniable. 



The adjaceny matrices for our three experiments are available at http://www.cis.jhu.edu/~parky/SGM 
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In all four examples, we increase the number of seeds m from zero to some substantial fraction 
of the (fixed) total number of vertices in the graphs, c, and attempt to match the remaining n 
vertices (n + m = c). (For the Wikipedia graphs in Section 3.2 c = 1382; for the Enron email 
graphs in Section [313, i 



184; for the C. elegans nervous system graphs in Section 3.4 



279. 



For the simulation in Section 3.1 we consider c = 300.) We report performance as the fraction of 
unseeded vertices correctly matched - where 0^ m - 1 agrees with correspondence function That 
is, the match ratio is given by 



\{v e vf m) \wi m) : <P {m) (y) = V(v)}\ 



n 



The expected number of vertices for which a randomly chosen bijection V\ — > Vi agrees with \I/ is 
1. For a given value of m, we need to match only the remaining n = c — m vertices; thus chance 
performance 1/n = l/(c — m) increases as m increases. In all cases, we observe that increases 
much faster than chance. 



3.1 Simulation 

Here we present a simple but illustrative simulation study, where the graphs are (dependent) Erdos- 
Renyi. We must specify a joint probability model for the pair (Gi,G 2 )- We use Gi ~ ER(c = 
n+m,p); G2 is obtained by flipping bits in G\ according to the "perturbation parameter" p G [0, 1]: 
given that edge uv G Ei, we let uv G E2 ~ Bernoulli(l — p); given that edge uv G" E\, we let 
uv G E 2 ~ Bernoulli(p). Thus p = means G2 is identical to G\ and we can hope for best case 
performance of n correct matches recovered, while p = 1/2 means G2 is independent of Gi and 
we expect chance performance of 1 correct match recovered. We consider c = 300 and p = 1/2. 



Simulation Simulation 




50 100 150 200 250 5 10 15 20 25 30 

m m 



Figure 1: Matching simulated graphs, plotting match ratio against the number of seeds m 
for various degrees of dependency between graphs. The perturbation parameter p increases from 
to 0.5 (from dark blue to red) in increments of 0.05, with performance decreasing monotonically 
as p increases. The right plot is a zoom-in, showing details for small m. NB: p = gives perfect 
matching, even for m = 0. 
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The graphs G\ and G2 are matched using modified- FAQ. Figure [T] plots the mean and standard 
error of the match ratio 5^ m ' against the number of seeds m, based on 400 randomly chosen seed 
sets W^ 1 for each m, for perturbation parameter p increasing from to 0.5 in increments of 0.05. 
(Chance is plotted in black, but is indistinguishable from p = 0.5 (red).) Notice that 5^ increases 
quickly as m increases and decreases as p increases, as expected. 

We note that perfect performance when p = - the darkest blue line in Figure [T] shows that 
the match ratio 5^ = 1 even for m = - indicates that modified-FAQ finds the isomorphism 
when it exists. 

3.2 Wikipedia 

Wikipedia is an online editable encyclopedia with 22 million articles (more than 4 million articles 
in English) in 285 languages. A collection of c = 1382 English articles were collected by crawling 
the (directed) 2-neighborhood of the document "Algebraic Geometry" using inter-language links 
from one English article to another. This first graph will be made a simple undirected graph 
by symmetrizing its adjacency matrix. In Wikipedia, intra-language links between articles of the 
same topic in different languages are available; thus, 1-1 correspondence information between the 
vertices of this English Wikipedia subgraph and some vertices of the French Wikipedia graph 
is available. Corresponding articles in French were collected and their inter-language links yield 
a second graph (not necessarily connected) which is also symmetrized^] Following the notation 
in previous sections, the English Wikipedia subgraph is denoted G\ and the French Wikipedia 
subgraph induced by the correspondents of the English Wikipedia articles is denoted G 2 . 

Wikipedia 



50 100 150 200 250 300 350 400 450 500 

m 

Figure 2: Matching French & English Wikipedia graphs, plotting match ratio 5^ against the 
number of seeds m. 

3 This data set was collected by Dr. David J. Marchette. 
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The English and French Wikipedia subgraphs G\ and Gi are matched using modified- FAQ. 
Figure [2] plots, in red, the mean and standard error of the match ratio 5^ against the number of 
seeds m, based on 100 randomly chosen seed sets for each m. (Chance is plotted in black.) 

We see dramatic performance improvement from incorporating just a few seeds: with no seeds 
<5 (0) « 1/100 (chance is 1/1382), while with just m = 50 seeds <5 (50) > 1/2 (chance is 1/1332). 

The blue curve in Figure [2] shows the match ratio 5^ for the unseeded problem on c—m vertices. 
While the problem becomes smaller as m increases, performance does not improve appreciably in 
terms of match ratio. 

3.3 Enron 

The Enron email corpus consists of messages amongst c = 184 employees of the Enron Corporation. 
Publicly available emails are used to compute a time series of graphs {G t '■ t = 1, . . . , T} on the 
actors, where each graph represents one week of emails. The inference task is to identify "chatter" 
anomalies - small groups of actors whose activity amongst themselves increases significantly for 
some week t. Previous work identified such an anomaly at week t = 132 (see [3]). 



Enron 




20 40 00 BO 100 120 140 



Figure 3: Matching Enron email graphs, plotting match ratio 5^ against the number of seeds m. 

The Enron email graphs for consecutive weeks t = 130, 131, 132 are matched, one pair at a 
time, using modified-FAQ. Figure [3] plots, for each pair, the mean and standard error of the match 
ratio 5^ against the number of seeds m, based on 100 randomly chosen seed sets for each 

m. (Chance is plotted in black.) The results are consistent with the finding reported in [3]: 
the match ratio S^ m ' is much higher between graphs for weeks t = 130, 131, where there was no 
significant change, compared to matching across the change (between t = 131, 132 and between 
t = 130, 132). The anomalous event at week t = 132 makes the graphs more different and the 
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graph matching more difficult. Indeed, investigation shows that the difference in performance is 
largely attributable to the vertices participating in the anomaly, as reported in [3]. 

3.4 C. elegans 

C. elegans is a roundworm that has been extensively studied; its particular usefulness comes from 
its simple nervous system, consisting of c = 279 neurons whose connections have been mapped [6] . 
There are two types of connections between neurons: chemical (chemical synapses) and electrical 
(junction potentials). The adjacency matrices for both graphs are sparse. Both G\ and Gi are 
weighted graphs; for sake of uniformity with our other examples, the adjacency matrices are 
binarized and symmetrized. 



C. Elegans 




20 40 60 80 100 120 140 160 180 200 



Figure 4: Matching Chemical and Electrical connectivity graphs of C. elegans nervous system, 
plotting match ratio 8^ m ' against the number of seeds m. 

The objective of this experiment is to match the chemical graph G\ to the electrical graph 
G 2 , using modified-FAQ. Figure [4] plots, in red, the mean and standard error of the match ratio 
a g a i ns t the number of seeds m, based on 100 randomly chosen seed sets for each m. 

(Chance is plotted in black.) Although performance improves when incorporating seeds, the match 
ratio for this experiment is significantly lower than for either Wikipedia or Enron. For instance, 
with m = 200 seeds, the remaining n = 79 vertices are matched with <5( 20 °) m 0.15 (chance is 
1/79). This suggests that the similarity between the two types of brain graphs, while significant, 
is tenuous. 
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4 Discussion 



Many graph inference tasks are more easily accomplished if the graphs under consideration are 
labeled - if we know the correspondence between vertices in graphs G\ and G^- We have demon- 
strated, via a simple but illustrative simulation and three real data experiments, the potential for 
dramatic performance improvement in identifying this correspondence from incorporating just a 
few seeds. 

In practice, identifying seeds Wi, W2, and their bijection if) may be costly. Thus, understand- 
ing the cost-benefit trade-off between inference without correspondence and inference performed 
subsequent to seeded graph matching is essential. This paper provides the foundation for that 
analysis. 

(Note that once the value of a few seeds is accepted, it seems clear that there will be a demand 
for an active learning methodology to identify the most cost-effective vertices to use as seeds.) 

As noted above, our methodology applies immediately in the broader setting where adjacency 
matrices are generic (nonsymmetric, nonhollow, and/or nonintegral); that is, to weighted, directed, 



loopy graphs. Figure [5] provides results for matching the C. elegans graphs of Section |3.4[ but 
in the case where the adjacency matrices have not been binarized and symmetrized. Comparing 
results for the original graphs vs. their simplified versions, we see that the addition of edge weights 
actually degrades performance significantly, suggesting that the edge weights in this case are not 
consistent across the two modalities. 



C. Elegans 




100 120 140 ISO 180 200 



Figure 5: Matching the original weighted directed loopy Chemical and Electrical connectivity 
graphs of C. elegans nervous system (blue), compared to the case where the adjacency matrices 
have been binarized and symmetrized (red). 

Obvious extensions to this work include: (a) the case where |Vi| 7^ [ V2 j - say, V\ C V2; (b) 
the case where the correspondence may be many-to-many; and (c) the case where the seeds are 
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"soft" rather than "hard" - that is, we know that it is likely (but not certain) that the bijection 
ip between seed sets W 1 and W 2 holds. Each of these extensions can be addressed within the 
framework presented here. 

In conclusion, we contend that the methodology presented herein forms the foundation for 
improving performance in myriad graph inference applications for which there exists a partially 
unknown correspondence between the vertices of various graphs. 
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